Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughConsolidates CI build/push into an application+stage matrix with a separate deploy_images job, removes scheduled sync/teardown triggers, renames trainer/training → runner/model-trainer across infra/storage/compute/exports, refactors TiDE training into workflow.train_model, and updates Maskfile, Dockerfiles, tests, and docs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Developer
participant GHA as "GitHub Actions"
participant Matrix as "workflow matrix"
participant Mask as "Mask / Flox"
participant ECR as "AWS ECR"
participant ECS as "AWS ECS"
Developer->>GHA: push / workflow_dispatch
GHA->>Matrix: expand application+stage matrix
Matrix->>Mask: run build (package, stage)
Mask->>ECR: check image by commit-hash
alt image exists
Mask-->>Matrix: skip build
else image missing
Mask->>Mask: build image (package-stage)
Mask->>ECR: push image (package-stage)
end
GHA->>GHA: deploy_images job triggered (matrix)
GHA->>ECS: deploy service using matrix.application/matrix.stage
ECS->>ECS: wait for service stability
GHA->>Developer: workflow complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip Migrating from UI to YAML configuration.Use the |
Greptile SummaryThis PR consolidates Docker image handling for TiDE model training (server + worker now share a single ECR repo and Dockerfile), renames ECS services/task families/log groups to a consistent Key changes:
Confidence Score: 1/5
|
| Filename | Overview |
|---|---|
| infrastructure/compute.py | Renames all ECS services, task families, log groups, and service-discovery entries to the new "-server"/"-worker" convention; consolidates server+worker into one shared image URI — but the server task definition is missing an entryPoint override, causing the Prefect server to fail to start (see P0 comment). |
| infrastructure/storage.py | Reverts S3 buckets from BucketV2 back to Bucket, fixes ECR repo names to use hyphens consistently, consolidates two separate training ECR repos into a single fund/model-trainer-server-worker repo, and fixes the lifecycle policy countNumber from 0 to 1 (preventing immediate expiry of untagged images). |
| models/Dockerfile | Renamed from tools/Dockerfile; renames the final stage from worker to server-worker and copies models/tide/ into the build. The ENTRYPOINT remains the Prefect worker command, which conflicts with the server ECS task definition that only sets command (see P0 comment on compute.py). |
| .github/workflows/launch_infrastructure.yaml | Removes scheduled launch trigger and adds a matrix-based deploy_images job that runs per-service ECS force-new-deployment after Pulumi completes; adds model-trainer as a build/push matrix entry and splits server/worker as separate deploy entries. |
| maskfile.md | Renames images → image and services → service subcommands, adds a new deploy command that runs per-service ECS deployments, and fixes the build command to map hyphenated package names to underscore directory names when resolving Dockerfile paths. |
| models/tide/src/tide/workflow.py | Absorbs the deleted trainer.py module by inlining train_model directly into workflow.py with deferred imports for tinygrad-heavy dependencies; updates the train_tide_model task to call the local function instead of the removed module import. |
| models/tide/src/tide/deploy.py | Restricts the TiDE training schedule to weekdays only (0 22 * * 1-5) and adds timezone="America/New_York" so the cron fires at 10 PM ET rather than UTC. |
Comments Outside Diff (1)
-
infrastructure/compute.py, line 495-507 (link)Server starts worker process instead of Prefect server
models/Dockerfilesets the image-levelENTRYPOINTto the worker command:ENTRYPOINT ["uv", "run", "--package", "tools", "prefect", "worker", "start", "--pool", "training-pool", "--type", "process"]The server container definition here only overrides
command(Docker CMD), notentryPoint. In Docker/ECS,commandis appended to the existingENTRYPOINTrather than replacing it. The server container would therefore execute:uv run --package tools prefect worker start --pool training-pool --type process bash -c "..."prefect worker startwould receivebashand-c "..."as extra positional arguments and error out. The Prefect server would never start, causing the ECS service to crash-loop and the ALB health check on port 4200 to fail.The fix is to add an
entryPointoverride that resets the Docker ENTRYPOINT so thecommandcan runbash -c "..."directly:"entryPoint": ["bash", "-c"], "command": [ ( "export PREFECT_API_DATABASE_CONNECTION_URL=" '$(python3 -c "' "import os, urllib.parse;" "p=urllib.parse.quote(os.environ['PREFECT_DB_PASSWORD'],safe='');" f"print(f'postgresql+asyncpg://prefect:{{p}}@{args[1]}/prefect')" '")' " && prefect server start --host 0.0.0.0" ) ],
Alternatively, change
models/Dockerfileto useCMDinstead ofENTRYPOINTfor the default worker command so ECScommandcan fully replace it.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: infrastructure/compute.py
Line: 495-507
Comment:
**Server starts worker process instead of Prefect server**
`models/Dockerfile` sets the image-level `ENTRYPOINT` to the worker command:
```
ENTRYPOINT ["uv", "run", "--package", "tools", "prefect", "worker", "start", "--pool", "training-pool", "--type", "process"]
```
The server container definition here only overrides `command` (Docker CMD), not `entryPoint`. In Docker/ECS, `command` is **appended** to the existing `ENTRYPOINT` rather than replacing it. The server container would therefore execute:
```
uv run --package tools prefect worker start --pool training-pool --type process bash -c "..."
```
`prefect worker start` would receive `bash` and `-c "..."` as extra positional arguments and error out. The Prefect server would never start, causing the ECS service to crash-loop and the ALB health check on port 4200 to fail.
The fix is to add an `entryPoint` override that resets the Docker ENTRYPOINT so the `command` can run `bash -c "..."` directly:
```python
"entryPoint": ["bash", "-c"],
"command": [
(
"export PREFECT_API_DATABASE_CONNECTION_URL="
'$(python3 -c "'
"import os, urllib.parse;"
"p=urllib.parse.quote(os.environ['PREFECT_DB_PASSWORD'],safe='');"
f"print(f'postgresql+asyncpg://prefect:{{p}}@{args[1]}/prefect')"
'")'
" && prefect server start --host 0.0.0.0"
)
],
```
Alternatively, change `models/Dockerfile` to use `CMD` instead of `ENTRYPOINT` for the default worker command so ECS `command` can fully replace it.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "Add fixes after manu..."
There was a problem hiding this comment.
Pull request overview
This PR refactors model-training and infrastructure tooling to consolidate Docker/ECR handling, align ECS/Pulumi naming, and simplify Prefect-related workflows/resources.
Changes:
- Moves TiDE
train_model(+ configuration) intotide.workflowand relocates tests accordingly. - Renames/standardizes infrastructure resources (ECR repos, ECS services, service discovery names) around the new “model-trainer” and “*-server” naming.
- Reworks
maskfile.mdand GitHub Actions to build/push/deploy images via unified commands and updated service naming.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| models/tide/tests/test_workflow.py | Adds train_model tests and updates patch targets after moving training logic. |
| models/tide/tests/test_trainer.py | Removes redundant trainer-module tests (now covered in workflow tests). |
| models/tide/tests/test_deploy.py | Updates expected Prefect cron schedule string. |
| models/tide/src/tide/workflow.py | Introduces DEFAULT_CONFIGURATION + train_model directly in workflow module; removes trainer import in flow. |
| models/tide/src/tide/trainer.py | Deletes legacy trainer module (functionality moved). |
| models/tide/src/tide/deploy.py | Updates Prefect deployment schedule and sets a timezone. |
| models/tide/Dockerfile | Renames CUDA stage from trainer to runner. |
| models/Dockerfile | Renames the final stage to server-worker. |
| maskfile.md | Consolidates image commands, adds ECS deploy command, updates model deploy/train/download commands. |
| libraries/python/tests/test_infrastructure_storage.py | Updates assertions for renamed ECR lifecycle resources. |
| infrastructure/storage.py | Renames ECR repositories and consolidates model-trainer to a single repo. |
| infrastructure/github_environment_runbook.md | Simplifies runbook instructions (bootstrap now sets secrets). |
| infrastructure/compute.py | Renames ECS/ALB/SG/RDS/Redis resources and switches server/worker to shared model-trainer image URI. |
| infrastructure/main.py | Updates Pulumi exports for renamed ECR resources and new Prefect API URL. |
| .github/workflows/teardown_infrastructure.yaml | Removes scheduled teardown trigger. |
| .github/workflows/sync_data.yaml | Removes repository-dispatch-based data sync workflow. |
| .github/workflows/launch_infrastructure.yaml | Updates matrix to build/push/deploy per app-stage and removes schedule + sync trigger. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
infrastructure/compute.py (1)
363-380:⚠️ Potential issue | 🔴 CriticalThis renames the live Prefect database instead of migrating it.
Changing both the Pulumi logical resource name (from
"prefect_database"to"model_trainer_state") and the RDS identifier (from"fund-prefect"to"fund-model-trainer-state") creates a new instance, not an in-place rename. The old databasefund-prefectwill be orphaned because Pulumi will attempt to delete it, but Pulumi's owndeletion_protection=Trueflag on the new resource does not affect the old resource. However, depending on how Pulumi handles this drift, you risk losing the existing Prefect state or creating deployment failures. Please migrate using a snapshot/restore process or revert the logical name and identifier back to their original values to perform an in-place update.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@infrastructure/compute.py` around lines 363 - 380, You renamed the Pulumi logical resource and the RDS identifier (new resource model_trainer_state with identifier "fund-model-trainer-state"), which causes Pulumi to create a new RDS instance and orphan the existing "fund-prefect" DB; to fix, either revert the Pulumi logical name and identifier back to the original values (restore the resource name prefect_database and identifier "fund-prefect") so Pulumi will perform an in-place update, or implement an explicit snapshot/restore migration: take a snapshot of the existing instance, restore that snapshot into the desired new instance identifier, import/replace the Pulumi resource state as needed, and ensure deletion_protection/backups are configured so the original instance isn’t inadvertently destroyed; update the code around model_trainer_state, identifier, and any references to the original prefect_database/fund-prefect to match the chosen approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@infrastructure/compute.py`:
- Around line 476-480: The server and worker task definitions both use the
shared variable model_trainer_server_worker_image_uri, causing image race
conditions; change the deployment to use distinct image identifiers (e.g.,
model_trainer_server_image_uri and model_trainer_worker_image_uri) or pin
immutable digests for each stage and replace
model_trainer_server_worker_image_uri in container_definitions and the other
occurrence (the similar container_definitions block around the second mention)
with the appropriate per-role image variable so server and worker always get the
correct image.
In `@maskfile.md`:
- Around line 213-214: The ECS waiter timeout is too short (Delay: 10,
MaxAttempts: 6 → 60s); update the aws ecs wait services-stable command to allow
up to ~5 minutes by increasing the waiter-config (e.g., keep "Delay": 10 and set
"MaxAttempts": 30 or set "Delay": 20 and "MaxAttempts": 15) so the ecs wait
services-stable --cluster "$cluster" --services "$service" uses the longer
timeout for stabilization.
In `@models/tide/src/tide/workflow.py`:
- Around line 41-45: The train_model signature currently types configuration as
plain dict | None; tighten it by declaring and using a more specific
configuration type (e.g., a TypedDict named ModelConfiguration or at minimum
dict[str, Any]) and update the parameter type to ModelConfiguration | None (or
dict[str, Any] | None) to match the shape seen in DEFAULT_CONFIGURATION; update
any imports (typing.TypedDict/Any) and adjust usages in train_model to use the
new type alias so type checkers can validate keys referenced inside the
function.
In `@models/tide/tests/test_deploy.py`:
- Around line 25-27: The test is missing an assertion for the new timezone
parameter: update the assertions in the test (where call_kwargs,
call_kwargs["parameters"], cron and LOOKBACK_DAYS are checked) to include assert
call_kwargs["parameters"]["timezone"] == "America/New_York" so the deployed
job's timezone is validated alongside cron, base_url, and lookback days.
In `@models/tide/tests/test_workflow.py`:
- Around line 171-179: The two tests test_train_model_returns_model_and_data and
test_train_model_uses_default_configuration call train_model() without
overriding epoch_count and thus depend on production defaults; update these
tests to either inject a cheap test configuration (e.g., pass epoch_count=1 or a
TestConfig object) into train_model or stub/mock the model.fit path so training
is cheap and deterministic—look for train_model, epoch_count and any internal
fit/training call to patch or parameterize, and make the same change for the
other affected tests (the ones around the 201-207 range) so all unit tests pin a
minimal training budget.
---
Outside diff comments:
In `@infrastructure/compute.py`:
- Around line 363-380: You renamed the Pulumi logical resource and the RDS
identifier (new resource model_trainer_state with identifier
"fund-model-trainer-state"), which causes Pulumi to create a new RDS instance
and orphan the existing "fund-prefect" DB; to fix, either revert the Pulumi
logical name and identifier back to the original values (restore the resource
name prefect_database and identifier "fund-prefect") so Pulumi will perform an
in-place update, or implement an explicit snapshot/restore migration: take a
snapshot of the existing instance, restore that snapshot into the desired new
instance identifier, import/replace the Pulumi resource state as needed, and
ensure deletion_protection/backups are configured so the original instance isn’t
inadvertently destroyed; update the code around model_trainer_state, identifier,
and any references to the original prefect_database/fund-prefect to match the
chosen approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8d146e9a-23e4-490e-93b7-9bce689ed45b
📒 Files selected for processing (17)
.github/workflows/launch_infrastructure.yaml.github/workflows/sync_data.yaml.github/workflows/teardown_infrastructure.yamlinfrastructure/__main__.pyinfrastructure/compute.pyinfrastructure/github_environment_runbook.mdinfrastructure/storage.pylibraries/python/tests/test_infrastructure_storage.pymaskfile.mdmodels/Dockerfilemodels/tide/Dockerfilemodels/tide/src/tide/deploy.pymodels/tide/src/tide/trainer.pymodels/tide/src/tide/workflow.pymodels/tide/tests/test_deploy.pymodels/tide/tests/test_trainer.pymodels/tide/tests/test_workflow.py
💤 Files with no reviewable changes (4)
- .github/workflows/teardown_infrastructure.yaml
- .github/workflows/sync_data.yaml
- models/tide/src/tide/trainer.py
- models/tide/tests/test_trainer.py
Fix five issues identified in code review:
- maskfile build+push: use ${AWS_REGION:-} instead of ${AWS_REGION} so
set -euo pipefail does not abort before the friendly error check when
the variable is unset
- maskfile build+push: override repository_name to
fund/model-trainer-server-worker when package_name is model-trainer,
fixing mismatch between the single shared ECR repo in infrastructure
and the fund/${package_name}-${stage_name} derivation that would
produce non-existent repo names
- maskfile build: translate hyphens to underscores via tr when resolving
the Dockerfile path for application services, fixing broken path for
data-manager, portfolio-manager, and ensemble-manager whose directories
use underscores
- CI workflow: change model-trainer matrix paths from the comma-separated
string models/Dockerfile,models/__init__.py (treated as a single
literal glob by dorny/paths-filter, never matching) to models/**
- workflow.py: tighten configuration type hint from dict | None to
dict[str, Any] | None per project type hint guidelines
- test_deploy.py: add assertion for timezone parameter
- test_workflow.py: pass epoch_count=1 to two tests that previously ran
the full 20-epoch production schedule, keeping unit tests cheap
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
maskfile.md (1)
220-221:⚠️ Potential issue | 🟡 MinorIncrease ECS waiter timeout to avoid flaky deploy failures.
Line 220 and Line 221 currently allow only ~60 seconds, which is commonly insufficient for ECS service stabilization during rolling updates.
⏱️ Suggested timeout adjustment
aws ecs wait services-stable --cluster "$cluster" --services "$service" \ - --waiter-config '{"Delay": 10, "MaxAttempts": 6}' + --waiter-config '{"Delay": 15, "MaxAttempts": 20}'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@maskfile.md` around lines 220 - 221, The ECS waiter timeout is too short for service stabilization; update the aws ecs wait services-stable invocation (the command using "$cluster" and "$service") to increase the waiter-config Delay/MaxAttempts values so the total wait is much longer (e.g., Delay 15 and MaxAttempts 40 for ~10 minutes) to avoid flaky deploy failures during rolling updates; modify the waiter-config JSON accordingly to reflect the new Delay and MaxAttempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@models/tide/src/tide/workflow.py`:
- Around line 26-38: DEFAULT_CONFIGURATION is exported as a regular mutable dict
which can be accidentally mutated; replace it with an immutable mapping (e.g.,
wrap the literal with types.MappingProxyType or expose a
get_default_configuration() that returns a shallow copy) so callers cannot
change module-level defaults. Update the top-level symbol DEFAULT_CONFIGURATION
to be the immutable MappingProxyType of the existing dict (or replace usages to
call get_default_configuration()), import types.MappingProxyType if needed, and
ensure any tests or code that previously mutated DEFAULT_CONFIGURATION instead
obtain a mutable copy via dict(DEFAULT_CONFIGURATION) or the getter.
---
Duplicate comments:
In `@maskfile.md`:
- Around line 220-221: The ECS waiter timeout is too short for service
stabilization; update the aws ecs wait services-stable invocation (the command
using "$cluster" and "$service") to increase the waiter-config Delay/MaxAttempts
values so the total wait is much longer (e.g., Delay 15 and MaxAttempts 40 for
~10 minutes) to avoid flaky deploy failures during rolling updates; modify the
waiter-config JSON accordingly to reflect the new Delay and MaxAttempts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: aab16c18-6084-48fe-bf77-573c8021d82a
📒 Files selected for processing (5)
.github/workflows/launch_infrastructure.yamlmaskfile.mdmodels/tide/src/tide/workflow.pymodels/tide/tests/test_deploy.pymodels/tide/tests/test_workflow.py
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.claude/commands/update-pull-request.md (1)
32-74:⚠️ Potential issue | 🟠 MajorAdd strict error handling to all bash blocks to catch silent failures in jq transformations.
Blocks 1 and 3 execute external commands (gh api, jq) without error checking. Block 2 validates Block 1's output, but Block 3's jq transformations produce files with no downstream validation. If jq fails on malformed input or file I/O issues, the output files are empty or invalid, and downstream steps (lines 164+, 335–336) proceed silently with corrupted data.
Add
set -euo pipefailat the start of each bash block (lines 32, 78, and 103) to exit immediately on command failure.Suggested fix
```bash + set -euo pipefail # Set up scratchpad directory with session isolation : "${TMPDIR:=/tmp}"```bash + set -euo pipefail SCRATCHPAD="<value printed above>" if ! command -v jq >/dev/null 2>&1; then```bash + set -euo pipefail SCRATCHPAD="<value printed above>" echo "Extracting structured data from raw responses..."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/commands/update-pull-request.md around lines 32 - 74, The script currently runs gh and jq invocations (e.g., gh api calls that write pr_meta_raw.json, pr_comments_raw.json, pr_reviews_raw.json, pr_review_comments_raw.json, check_runs_raw.json and the INLINE_QUERY GraphQL call that writes thread_ids_raw.json) without strict failure handling; add set -euo pipefail at the top of each bash block that performs external commands (the scratchpad creation block that sets SCRATCHPAD and the blocks that call gh/jq and that perform jq transformations) so the script exits on errors/undefined variables and failed pipes, and also ensure any jq transformations that write derived files validate input (e.g., check files exist and non-empty after jq runs) before proceeding; update blocks around SCRATCHPAD creation, the gh api fetches, and the jq processing/INLINE_QUERY usage accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/commands/update-pull-request.md:
- Around line 307-330: The script currently uses hardcoded placeholder thread
IDs in the threads list and calls urllib.request.urlopen without a timeout or
error checking; update it to read real thread IDs from the scratchpad file
instead of the hardcoded "threads" list, add a timeout to the network call (use
the timeout parameter on urllib.request.urlopen), and validate the GraphQL
response by checking for an "errors" key and HTTP error conditions before using
data; also wrap the token retrieval, network call (urllib.request.urlopen) and
JSON parsing in try/except to log/print failures and continue processing other
thread IDs so one error doesn’t hang or abort the whole script.
- Around line 56-60: The REST gh api calls that save PR data (the commands
writing to "${SCRATCHPAD}/pr_meta_raw.json",
"${SCRATCHPAD}/pr_comments_raw.json", "${SCRATCHPAD}/pr_reviews_raw.json", and
"${SCRATCHPAD}/pr_review_comments_raw.json") only fetch the first page; add the
--paginate flag to each of those gh api invocations so all pages are retrieved.
For the GraphQL query that extracts review threads (the block capped at 100),
implement cursor-based pagination (use endCursor/hasNextPage and loop until
hasNextPage is false) so all threads are fetched, and when using jq to write
combined outputs, add --slurp to produce a single JSON array to avoid fragmented
output. Ensure the same --paginate and pagination approach is applied to the
other REST/GraphQL review/comments fetches referenced in the file (including the
review-threads section).
---
Outside diff comments:
In @.claude/commands/update-pull-request.md:
- Around line 32-74: The script currently runs gh and jq invocations (e.g., gh
api calls that write pr_meta_raw.json, pr_comments_raw.json,
pr_reviews_raw.json, pr_review_comments_raw.json, check_runs_raw.json and the
INLINE_QUERY GraphQL call that writes thread_ids_raw.json) without strict
failure handling; add set -euo pipefail at the top of each bash block that
performs external commands (the scratchpad creation block that sets SCRATCHPAD
and the blocks that call gh/jq and that perform jq transformations) so the
script exits on errors/undefined variables and failed pipes, and also ensure any
jq transformations that write derived files validate input (e.g., check files
exist and non-empty after jq runs) before proceeding; update blocks around
SCRATCHPAD creation, the gh api fetches, and the jq processing/INLINE_QUERY
usage accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 30d71002-9f0d-4c39-ac89-07d5835a35b5
📒 Files selected for processing (1)
.claude/commands/update-pull-request.md
There was a problem hiding this comment.
Pull request overview
This PR consolidates model-training code and modernizes infrastructure/image handling by standardizing service naming, ECR repositories, and ECS deployment workflows (aligned with the newer Prefect pattern).
Changes:
- Moved TiDE
train_model(and its default configuration) intotide.workflow, updating tests accordingly and removingtide.trainer. - Standardized ECS/ECR naming (notably “*-server” suffixes) and consolidated model-trainer images into a single ECR repository.
- Updated automation: adjusted Prefect deployment schedule (weekdays + timezone), removed scheduled infra teardown/data sync workflows, and added a CLI deploy command for ECS services.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
models/tide/src/tide/workflow.py |
Adds DEFAULT_CONFIGURATION + train_model to workflow module; removes deferred import from old trainer. |
models/tide/src/tide/trainer.py |
Removes the old training module (now consolidated into workflow). |
models/tide/tests/test_workflow.py |
Updates patching target and adds unit tests for train_model now in tide.workflow. |
models/tide/tests/test_trainer.py |
Removes trainer-focused tests (moved into workflow tests). |
models/tide/src/tide/deploy.py |
Updates Prefect deployment schedule (weekday cron + timezone). |
models/tide/tests/test_deploy.py |
Aligns tests with updated cron + timezone. |
maskfile.md |
Reworks image build/push semantics around package_name, adds image deploy, and updates model commands. |
.github/workflows/launch_infrastructure.yaml |
Updates CI to build/push and now attempt ECS deploy per matrix entry. |
.github/workflows/teardown_infrastructure.yaml |
Removes scheduled teardown (manual only). |
.github/workflows/sync_data.yaml |
Removes scheduled data-sync workflow. |
infrastructure/storage.py |
Renames ECR repos/lifecycle policies; consolidates model-trainer repo naming. |
infrastructure/compute.py |
Renames ECS resources, service discovery names, log groups, and Prefect state/broker resources. |
infrastructure/__main__.py |
Updates Pulumi exports to match new repo/service names and URLs. |
libraries/python/tests/test_infrastructure_storage.py |
Updates assertions for renamed lifecycle policy resources. |
models/Dockerfile |
Renames build stage to server-worker. |
models/tide/Dockerfile |
Renames CUDA stage from trainer to runner. |
infrastructure/github_environment_runbook.md |
Documents automated population of env secrets via bootstrap. |
.claude/commands/update-pull-request.md |
Updates operational guidance to prefer REST + Python over MCP GitHub tools. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…resolve script, Pulumi stack selection - Add --paginate flag to REST array endpoint fetches in update-pull-request.md, piping through jq -s 'add' to flatten multi-page responses into a single array - Replace hardcoded placeholder thread IDs in Python resolve script with automatic loading from thread_ids.json in scratchpad; add timeout=30 and GraphQL error checking - Add pulumi org get-default + pulumi stack select pattern to mask infrastructure image deploy command, matching the existing convention used in other mask commands - Add PULUMI_ACCESS_TOKEN env var to the deploy step in launch_infrastructure.yaml so Pulumi is authenticated when the deploy command runs in CI Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/launch_infrastructure.yaml:
- Around line 74-83: The pre-stack ECS service deploy step (the step named
"Deploy ${{ matrix.application }}-${{ matrix.stage }}" that runs "mask
infrastructure image deploy") can fail if services or infra are renamed; reorder
the workflow so the Pulumi stack operation ("mask infrastructure stack up" run
in the launch_infrastructure job) happens before any per-service deploys, i.e.
make the "launch_infrastructure" job run immediately after build/push and have
the matrix/service deploy steps depend on its completion (or move "mask
infrastructure image deploy" into a job that lists launch_infrastructure as a
prerequisite), ensuring "mask infrastructure image deploy" only runs after "mask
infrastructure stack up".
- Around line 4-7: The workflow currently triggers on every push to master;
narrow the trigger by adding on.push.paths to only infra/service-relevant
patterns (e.g. infrastructure/**, pulumi/**, deploy/**, charts/**,
scripts/pulumi/**) and/or on.push.paths-ignore for docs/metadata (e.g. **/*.md)
so the Pulumi job only runs for relevant changes; update the YAML's on.push
block (keeping branches: [master]) to include these path patterns to reduce
unnecessary full infra runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 638dec68-1539-4040-8435-6934342165a8
📒 Files selected for processing (3)
.claude/commands/update-pull-request.md.github/workflows/launch_infrastructure.yamlmaskfile.md
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/launch_infrastructure.yaml:
- Around line 98-154: The deploy_images job duplicates the matrix and path
filter definitions; consolidate by extracting the matrix and path-filter config
into YAML anchors (e.g., define a top-level anchor like &matrix-config for the
matrix entries and &service-paths for the paths list) and then reference them in
deploy_images (strategy.matrix: <<: *matrix-config) and in the Check for service
changes step (use the anchored list for the filters input), ensuring job name
deploy_images and the paths-filter step id changes still consume the anchored
values; alternatively create a reusable workflow that accepts the matrix as an
input and have both deploy_images and build_and_push_images call it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ebca9af7-8cb3-46d3-bf49-c9db9995a547
📒 Files selected for processing (1)
.github/workflows/launch_infrastructure.yaml
There was a problem hiding this comment.
Pull request overview
This PR standardizes Docker/ECR/ECS naming and deployment mechanics across the repo, consolidates TiDE training code into the workflow module, and updates CI/runbooks to match the new infrastructure + Prefect patterns.
Changes:
- Consolidates TiDE model training logic into
tide.workflowand migrates unit tests accordingly. - Standardizes ECR repo names / ECS service names (notably adding consistent
-serversuffixes and consolidating model-trainer images). - Reworks
maskinfra commands and GitHub Actions workflow steps for build/push/deploy, while removing scheduled data-sync and infra scheduling.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| models/tide/tests/test_workflow.py | Adds coverage for train_model now located in tide.workflow; fixes patch target. |
| models/tide/tests/test_trainer.py | Removes trainer-focused tests after moving training logic. |
| models/tide/tests/test_deploy.py | Updates expected Prefect deployment schedule (weekdays + timezone). |
| models/tide/src/tide/workflow.py | Introduces DEFAULT_CONFIGURATION + train_model and removes tide.trainer dependency from the flow. |
| models/tide/src/tide/trainer.py | Removes the standalone trainer module (training now in workflow). |
| models/tide/src/tide/deploy.py | Deploys Prefect schedule with weekday cron and explicit timezone. |
| models/tide/Dockerfile | Renames build stage to runner to match “tide runner” semantics. |
| models/Dockerfile | Renames stage to server-worker for combined model-trainer image. |
| maskfile.md | Consolidates image build/push logic, adds ECS deploy command, updates model commands and service naming. |
| libraries/python/tests/test_infrastructure_storage.py | Updates assertions for new lifecycle policy resource names. |
| infrastructure/storage.py | Renames ECR repos/lifecycle policies; consolidates model-trainer to a single repo. |
| infrastructure/github_environment_runbook.md | Updates runbook to reflect bootstrap-driven secret population. |
| infrastructure/compute.py | Renames ECS services, SD names, log groups, and Prefect state/broker resources to new naming scheme. |
| infrastructure/main.py | Updates Pulumi exports for renamed repos/images and new service discovery hostname. |
| .github/workflows/teardown_infrastructure.yaml | Removes scheduled teardown (now manual only). |
| .github/workflows/sync_data.yaml | Removes the sync-data workflow. |
| .github/workflows/launch_infrastructure.yaml | Updates build/push/deploy flow and matrix naming; removes schedule/manual triggers. |
| .claude/commands/update-pull-request.md | Updates automation guidance to use REST + Python for thread resolution reliability. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR cleans up build/deploy and infrastructure naming, consolidates TiDE model-training code into the workflow module, and updates CI/CD + tooling to match the new ECS/model-trainer layout and Prefect scheduling pattern.
Changes:
- Consolidate TiDE training logic/tests by moving
train_model+ defaults intotide.workflowand updating tests accordingly. - Rename/standardize infrastructure resources (ECR repos, ECS services, service discovery names) and update workflows/tooling to build/push/deploy by
package_name+stage_name. - Adjust operational automation: remove scheduled sync/teardown workflows and change training deployment to weekday-only with an explicit timezone.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
models/tide/tests/test_workflow.py |
Adds train_model unit tests and updates patching/imports after trainer consolidation. |
models/tide/tests/test_trainer.py |
Removes trainer-specific tests (moved into workflow tests). |
models/tide/tests/test_deploy.py |
Updates expectations for weekday cron + timezone in Prefect deployment. |
models/tide/src/tide/workflow.py |
Adds DEFAULT_CONFIGURATION + train_model implementation; removes import dependency on removed trainer module. |
models/tide/src/tide/trainer.py |
Removes standalone trainer module (logic moved into workflow). |
models/tide/src/tide/deploy.py |
Updates Prefect deployment schedule (weekday-only) and sets timezone. |
models/tide/Dockerfile |
Renames the CUDA stage target from trainer to runner. |
models/Dockerfile |
Renames stage target from worker to server-worker for the consolidated model-trainer image. |
maskfile.md |
Renames/expands image commands (build/push/deploy), standardizes naming, and updates Pulumi import hints. |
libraries/python/tests/test_infrastructure_storage.py |
Updates infra snapshot expectations for renamed ECR lifecycle resources. |
infrastructure/storage.py |
Changes S3 bucket resource types/names and standardizes ECR repository/lifecycle resources. |
infrastructure/github_environment_runbook.md |
Updates runbook to reflect automated secret population via mask ... --bootstrap. |
infrastructure/compute.py |
Renames ECS services/log groups/service discovery and refactors Prefect backing resources to “model-trainer-*”. |
infrastructure/__main__.py |
Updates exported outputs for renamed ECR resources and training API URL. |
.github/workflows/teardown_infrastructure.yaml |
Removes scheduled teardown; keeps manual-only runs. |
.github/workflows/sync_data.yaml |
Removes automated data sync workflow. |
.github/workflows/launch_infrastructure.yaml |
Reworks to matrix build/push + separate deploy step; removes scheduled launch. |
.claude/commands/update-pull-request.md |
Updates internal PR-update instructions to prefer REST + gh api and Python for GraphQL mutations. |
Comments suppressed due to low confidence (2)
infrastructure/compute.py:383
- Renaming the RDS instance resource and changing
identifierwill force a brand-new database to be created (the old one cannot be renamed in-place), which risks downtime/data loss for Prefect unless there’s an explicit migration plan. If the intent is a logical rename only, keep the existing identifier and addResourceOptions(aliases=[...])to adopt the prior Pulumi resource; if a new DB is intended, document the migration/backup steps and how the new secret ARN will be wired into consumers.
# RDS PostgreSQL for Prefect database
model_trainer_state = aws.rds.Instance(
"model_trainer_state",
identifier="fund-model-trainer-state",
engine="postgres",
engine_version="14",
instance_class="db.t3.micro",
allocated_storage=20,
db_name="prefect",
username="prefect",
manage_master_user_password=True,
db_subnet_group_name=prefect_rds_subnet_group.name,
vpc_security_group_ids=[prefect_rds_security_group.id],
skip_final_snapshot=False,
final_snapshot_identifier=f"fund-model-trainer-state-final-{pulumi.get_stack()}",
backup_retention_period=7,
storage_encrypted=True,
deletion_protection=True,
tags=tags,
)
infrastructure/compute.py:426
- Changing the ElastiCache
cluster_id(and Pulumi logical name) will replace the Redis cluster. If existing environments depend on the current broker state/endpoint, add aliases and keep the existingcluster_id, or document the migration and ensure all clients (Prefect server/worker) are updated to the new endpoint as part of the same rollout.
# ElastiCache Redis for Prefect messaging
model_trainer_broker = aws.elasticache.Cluster(
"model_trainer_broker",
cluster_id="fund-model-trainer-broker",
engine="redis",
engine_version="7.0",
node_type="cache.t3.micro",
num_cache_nodes=1,
subnet_group_name=prefect_elasticache_subnet_group.name,
security_group_ids=[prefect_redis_security_group.id],
tags=tags,
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…nsitive dependency
There was a problem hiding this comment.
Pull request overview
This PR consolidates Docker image handling and deployment workflows, renames/standardizes ECS + ECR/Pulumi resources (notably the model-trainer services), and moves TiDE training logic into the workflow module while adjusting Prefect deployment scheduling.
Changes:
- Reworked Mask + GitHub Actions workflows for build/push/deploy, removed scheduled sync/teardown automation, and added an explicit ECS deploy command.
- Renamed and consolidated AWS/Pulumi resources (ECR repos, ECS services, service discovery, logs, Prefect DB/Redis) to consistent
*-server/model-trainer-*naming. - Moved TiDE
train_model+ default config intotide.workflow, updated tests, and adjusted training deployment schedule to weekdays with an explicit timezone.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds pytz to the lockfile dependency set. |
models/tide/tests/test_workflow.py |
Updates workflow tests and adds coverage for the relocated train_model. |
models/tide/tests/test_trainer.py |
Removes trainer tests now that training logic moved out of tide.trainer. |
models/tide/tests/test_deploy.py |
Updates expected Prefect cron schedule + timezone. |
models/tide/src/tide/workflow.py |
Introduces DEFAULT_CONFIGURATION + train_model and removes import from the deleted trainer module. |
models/tide/src/tide/trainer.py |
Deletes the old trainer module (training logic relocated). |
models/tide/src/tide/deploy.py |
Deploys Prefect flow on weekdays and sets timezone. |
models/tide/Dockerfile |
Renames the CUDA stage to runner (aligns with “tide runner” usage). |
models/Dockerfile |
Renames the runtime stage to server-worker for the consolidated model-trainer image target. |
maskfile.md |
Consolidates image commands under infrastructure image, adds deploy, updates Pulumi imports, and adds model command routing. |
libraries/python/tests/test_infrastructure_storage.py |
Updates infra snapshot expectations for new ECR repo lifecycle resource names. |
infrastructure/storage.py |
Updates S3 bucket resources and standardizes ECR repository naming + lifecycle resources. |
infrastructure/notifications.py |
Documents Pulumi config variable controlling monthly budget. |
infrastructure/github_environment_runbook.md |
Updates runbook to reflect automated env-secret population via mask ... --bootstrap. |
infrastructure/compute.py |
Renames ECS services/SD/log groups and consolidates server+worker to a single model-trainer image URI; tweaks ALB config. |
infrastructure/__main__.py |
Updates Pulumi stack outputs to reflect renamed repos/images and training API hostname. |
infrastructure/Pulumi.production.yaml |
Raises monthlyBudgetLimitUsd config value. |
applications/portfolio_manager/pyproject.toml |
Adds pytz dependency. |
applications/ensemble_manager/Dockerfile |
Copies tools/ into the image (supports shared tooling usage). |
applications/data_manager/Dockerfile |
Sets CARGO_BUILD_JOBS=4 during the build stage. |
.github/workflows/teardown_infrastructure.yaml |
Removes scheduled teardown; keeps manual workflow dispatch. |
.github/workflows/sync_data.yaml |
Removes the sync-data workflow. |
.github/workflows/launch_infrastructure.yaml |
Switches to a matrix build/push + deploy workflow; removes schedule trigger. |
.claude/commands/update-pull-request.md |
Updates internal PR-update automation guidance to prefer REST + Python for thread resolution. |
Comments suppressed due to low confidence (1)
models/Dockerfile:24
- The
server-workerstage is now the canonical build target for the model-trainer image, but it hardcodes an ENTRYPOINT that starts a Prefect worker. Since the ECS model-trainer server task definition uses this same image and tries to override behavior viacommand, the server container may not start correctly (ECScommandwon’t replace ENTRYPOINT). Consider usingCMD(so ECS can override), or provide separateserverandworkertargets/images.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Overview
Changes
Context
Some additional fixes given the new Prefect pattern.
Summary by CodeRabbit
Chores
New Features
Behavior Changes